Frontiers runs a number of open access journals in several scientific fields. Authors can submit their articles for publication to one of these journals. However, in some cases the authors may not be aware of the journal that best matches the scope of their paper. If the wrong journal is chosen, it may result in delays or even rejection. To this end, we are developing a feature that suggests to the authors the three most relevant journals to their manuscript, to choose from.
You are tasked to build a text classifier for this feature that, given some input text, can recommend the most suitable Frontiers journals to it.
You have at your disposal a .jsonl file containing:
Please email your solution in .zip format to davide.fiocco@frontiersin.org and be prepared to discuss it in the next interview stage.
This report is divided into the following sections:
The task consists to develop an algorithm that, given a scientific paper (or a simple text/report), it recommends the most suitable Frontiers journals. Several methodologies could be used to define the best recommendation system and classifier. However, it strongly depends on the number of classes (Frontiers Journals) to be predicted. A previous study (Meijer et al. Document Embedding for Scientific Articles: Efficacy of Word Embeddings vs TFIDF. 2021), already compare document embeddings using TFIDF and WordEmbeddings for classification of a huge dataset of scientific papers (70 million) into 30 thousand distinct journals or conferences.
Here, I develop several variations of document embedding using:
From the text defined before I tested several embedding strategies such as:
This section is divided in:
Simple import of all needed libraries
import os
path = "/".join(os.getcwd().split("/")[:-1])
os.chdir(path)
from src.utils.utils import load_data
from src.preprocess.preprocess import filter_papers_min_sample
from src.utils.utils import load_data, IO
from src.train.document_approach import create_embeddings_document
from src.train.train import (train_test,
train_embeddings_keyword_word2vec,
train_embeddings_document_word2vec,
train_embeddings_keyword_tfidf,
train_embeddings_document_tfidf,
train_embeddings_keyword_sbert,
train_embeddings_document_sbert
)
from src.preprocess.preprocess import filter_papers_min_sample, preprocess
from src.evaluate.evaluate import (evaluate_document_word2vec,
evaluate_keyword_word2vec,
evaluate_keyword_tfidf,
evaluate_document_tfidf,
evaluate_keyword_sbert,
evaluate_document_sbert)
from sklearn.manifold import TSNE
import numpy as np
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
import plotly.express as px
[nltk_data] Downloading package punkt to /home/felipe/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package stopwords to /home/felipe/nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package wordnet to /home/felipe/nltk_data... [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package omw-1.4 to /home/felipe/nltk_data... [nltk_data] Package omw-1.4 is already up-to-date!
The EDA shows:
The definition of the preprocessing of the text;
The definition of the keywords and how to extract them;
# Load of the dataset
df = load_data()
# Print of one example
df.head(5)
| id | text | journal | |
|---|---|---|---|
| 0 | 465950 | \n Sleep Characteristics and Influencing Facto... | Frontiers in Medicine |
| 1 | 483526 | A Hybrid Approach for Modeling Type 2 Diabetes... | Frontiers in Genetics |
| 2 | 482500 | \n Relationship Between SES and Academic Achie... | Frontiers in Psychology |
| 3 | 437333 | Environmental Health Research in Africa: Impor... | Frontiers in Genetics |
| 4 | 486515 | \n 3,5-T2—A Janus-Faced Thyroid Hormone Metabo... | Frontiers in Endocrinology |
# 1. The number of scientific articles published for each Frontiers journal;
documents_per_journal, df_subset = filter_papers_min_sample(df)
documents_per_journal = documents_per_journal.reset_index().rename(columns={0:"count"})
fig = px.bar(documents_per_journal, x='journal', y='count')
fig.show()
Conclusion 1: The number of published articles for each journal is strongly unbalanced. In order to evaluate the methods presented in the next section, I filter the original dataset with only the journals that received at least 2 publications.
# 2. The distribution of the length of the text;
df_subset["len_text"] = df_subset["text"].apply(lambda x: len(x))
fig = px.histogram(df_subset, x="len_text",nbins=100)
fig.show()
Conclusion 2: It does not seem a normal distribution because of the tail (it seems a binomial distribution). However, the sample dimension is too short to evaluate.
# 3. The definition of the train and test split;
df_train, df_test = train_test(df_subset)
Conclusion 3: The test size is defined at 33%
# 4. Preprocessing of the train and
df_train = IO(filename="df_train_preprocessed",folder="02_intermediate",format_="pickle").load()
df_test = IO(filename="df_test_preprocessed",folder="02_intermediate",format_="pickle").load()
df_train[["id","text","preprocessed_text","journal"]].head(5)
| id | text | preprocessed_text | journal | |
|---|---|---|---|---|
| 0 | 494570 | \n \n Low Testosterone in Adolescents & Young ... | testosterone adolescents young adults jordan c... | Frontiers in Endocrinology |
| 1 | 483146 | \n Dynamics and Outcome of Macrophage Interact... | dynamics outcome macrophage interaction salmon... | Frontiers in Cellular and Infection Microbiology |
| 2 | 493402 | \n Oral Treatments With Probiotics and Live S... | oral treatments probiotics live salmonella vac... | Frontiers in Microbiology |
| 3 | 475909 | \n A Systematic, Regional Assessment of High M... | systematic regional assessment high mountain a... | Frontiers in Earth Science |
| 4 | 508059 | \n Hot Water Extracted and Non-extracted Willo... | water extracted extracted willow biomass stora... | Frontiers in Energy Research |
# 5. The definition of the keywords and how to extract them;
df_train[["id","text","keywords","journal"]].head(5)
| id | text | keywords | journal | |
|---|---|---|---|---|
| 0 | 494570 | \n \n Low Testosterone in Adolescents & Young ... | [testosterone, obesity, diabetes, adolescence,... | Frontiers in Endocrinology |
| 1 | 483146 | \n Dynamics and Outcome of Macrophage Interact... | [S . Typhimurium, S . Gallinarum, S . Dublin, ... | Frontiers in Cellular and Infection Microbiology |
| 2 | 493402 | \n Oral Treatments With Probiotics and Live S... | [probiotics, poultry, intestine, neurochemical... | Frontiers in Microbiology |
| 3 | 475909 | \n A Systematic, Regional Assessment of High M... | [digital elevation model (DEM), Himalaya, geod... | Frontiers in Earth Science |
| 4 | 508059 | \n Hot Water Extracted and Non-extracted Willo... | [willow biomass, hot water extraction, bioener... | Frontiers in Energy Research |
Conclusion 5: Each scientific paper is published with a list of keywords that identify the article. I define a simple rule to extract the keywords from the text when they are located between the word "Keywords:" and "Citation:". Furthermore, in case of the rule-based keyword extractor fails I applied the TextRank algorithm to extract the top 5 keywords from the text.
To analyze and compare the results I used five metrics:
Furthermore, I define a baseline model. This should the simpler possible model. As the classes are strongly unbalanced, I define a model that always predicts the top three journals with a higher number of publications. The results show:
baseline_model_performance = IO(filename="evaluation_baseline",folder="05_report",format_="json").load()
print(f'Average accuracy: {baseline_model_performance["accuracy_total"]}; MRR: {baseline_model_performance["mean_reciprocal_rank"]}')
print(baseline_model_performance["precision_recall_f1score"])
Average accuracy: 0.26; MRR: 0.16
precision recall f1-score support
Frontiers for Young Minds 0.00 0.00 0.00 6
Frontiers in Aging Neuroscience 0.00 0.00 0.00 10
Frontiers in Applied Mathematics and Statistics 0.00 0.00 0.00 2
Frontiers in Artificial Intelligence 0.00 0.00 0.00 1
Frontiers in Astronomy and Space Sciences 0.00 0.00 0.00 1
Frontiers in Behavioral Neuroscience 0.00 0.00 0.00 7
Frontiers in Big Data 0.00 0.00 0.00 1
Frontiers in Bioengineering and Biotechnology 0.00 0.00 0.00 18
Frontiers in Blockchain 0.00 0.00 0.00 2
Frontiers in Built Environment 0.00 0.00 0.00 3
Frontiers in Cardiovascular Medicine 0.00 0.00 0.00 4
Frontiers in Cell and Developmental Biology 0.00 0.00 0.00 15
Frontiers in Cellular Neuroscience 0.00 0.00 0.00 9
Frontiers in Cellular and Infection Microbiology 0.00 0.00 0.00 15
Frontiers in Chemistry 0.00 0.00 0.00 23
Frontiers in Communication 0.00 0.00 0.00 1
Frontiers in Computational Neuroscience 0.00 0.00 0.00 5
Frontiers in Earth Science 0.00 0.00 0.00 9
Frontiers in Ecology and Evolution 0.00 0.00 0.00 11
Frontiers in Education 0.00 0.00 0.00 4
Frontiers in Endocrinology 0.00 0.00 0.00 17
Frontiers in Energy Research 0.00 0.00 0.00 5
Frontiers in Environmental Science 0.00 0.00 0.00 3
Frontiers in Forests and Global Change 0.00 0.00 0.00 3
Frontiers in Genetics 0.00 0.00 0.00 33
Frontiers in Human Neuroscience 0.00 0.00 0.00 8
Frontiers in Immunology 1.00 1.00 1.00 61
Frontiers in Integrative Neuroscience 0.00 0.00 0.00 2
Frontiers in Marine Science 0.00 0.00 0.00 21
Frontiers in Materials 0.00 0.00 0.00 6
Frontiers in Mechanical Engineering 0.00 0.00 0.00 2
Frontiers in Medicine 0.00 0.00 0.00 14
Frontiers in Microbiology 0.11 1.00 0.20 79
Frontiers in Molecular Biosciences 0.00 0.00 0.00 6
Frontiers in Molecular Neuroscience 0.00 0.00 0.00 8
Frontiers in Neural Circuits 0.00 0.00 0.00 2
Frontiers in Neuroanatomy 0.00 0.00 0.00 1
Frontiers in Neuroinformatics 0.00 0.00 0.00 1
Frontiers in Neurology 0.00 0.00 0.00 25
Frontiers in Neurorobotics 0.00 0.00 0.00 3
Frontiers in Neuroscience 0.00 0.00 0.00 29
Frontiers in Nutrition 0.00 0.00 0.00 3
Frontiers in Oncology 0.00 0.00 0.00 43
Frontiers in Pediatrics 0.00 0.00 0.00 14
Frontiers in Pharmacology 0.00 0.00 0.00 57
Frontiers in Physics 0.00 0.00 0.00 11
Frontiers in Physiology 0.00 0.00 0.00 35
Frontiers in Plant Science 0.00 0.00 0.00 48
Frontiers in Psychiatry 0.00 0.00 0.00 28
Frontiers in Psychology 1.00 1.00 1.00 73
Frontiers in Public Health 0.00 0.00 0.00 10
Frontiers in Robotics and AI 0.00 0.00 0.00 5
Frontiers in Sociology 0.00 0.00 0.00 1
Frontiers in Sports and Active Living 0.00 0.00 0.00 3
Frontiers in Surgery 0.00 0.00 0.00 3
Frontiers in Sustainable Food Systems 0.00 0.00 0.00 4
Frontiers in Synaptic Neuroscience 0.00 0.00 0.00 1
Frontiers in Systems Neuroscience 0.00 0.00 0.00 3
Frontiers in Veterinary Science 0.00 0.00 0.00 15
accuracy 0.26 833
macro avg 0.04 0.05 0.04 833
weighted avg 0.17 0.26 0.18 833
The tested methods could be split into two contexts:
For both representations of the documents, I define three methodologies to create a document embedding and then a journal embedding such as:
Considering the text of the documents of the training set, I define
After this first step, I have an embedding for each document. Finally, I average all the documents embeddings related to the same Frontiers Journal obtaining a journal embeddings.
Considering the list of keywords extracted from each document, I define the document embedding as the average of the embeddings associated with each keyword. Where the embeddings associated with each keyword are defined using:
After these steps, I have an embedding for each document. Finally, I average all the documents embeddings related to the same Frontiers Journal obtaining journal embeddings.
Here I show the perfomance of the tested models
df_train_embeddings = create_embeddings_document(df_train, "tfidf")
def get_tsne(df):
tsne = TSNE(n_components=2,perplexity=30, n_iter=1000, random_state=42)
tsne_result = tsne.fit_transform(np.stack(df["embeddings"], axis=0))
df_tsne = df[["journal"]]
df_tsne['tsne-2d-one'] = tsne_result[:,0]
df_tsne['tsne-2d-two'] = tsne_result[:,1]
fig = px.scatter(df_tsne, x="tsne-2d-one", y="tsne-2d-two", color="journal")
fig.show()
get_tsne(df_train_embeddings)
model = IO(filename="evaluation_document_tfidf",folder="05_report",format_="json").load()
print(f'Average accuracy: {model["accuracy_total"]}; MRR: {model["mean_reciprocal_rank"]}')
print(model["precision_recall_f1score"])
Average accuracy: 0.81; MRR: 0.68
precision recall f1-score support
Frontiers for Young Minds 1.00 1.00 1.00 6
Frontiers in Aging Neuroscience 0.67 0.80 0.73 10
Frontiers in Applied Mathematics and Statistics 0.00 0.00 0.00 2
Frontiers in Artificial Intelligence 0.00 0.00 0.00 1
Frontiers in Astronomy and Space Sciences 0.00 0.00 0.00 1
Frontiers in Behavioral Neuroscience 0.57 0.57 0.57 7
Frontiers in Big Data 0.00 0.00 0.00 1
Frontiers in Bioengineering and Biotechnology 0.76 0.72 0.74 18
Frontiers in Blockchain 0.00 0.00 0.00 2
Frontiers in Built Environment 0.00 0.00 0.00 3
Frontiers in Cardiovascular Medicine 0.75 0.75 0.75 4
Frontiers in Cell and Developmental Biology 0.55 0.80 0.65 15
Frontiers in Cellular Neuroscience 0.43 0.33 0.38 9
Frontiers in Cellular and Infection Microbiology 0.82 0.60 0.69 15
Frontiers in Chemistry 0.92 0.96 0.94 23
Frontiers in Communication 0.00 0.00 0.00 1
Frontiers in Computational Neuroscience 0.80 0.80 0.80 5
Frontiers in Earth Science 0.88 0.78 0.82 9
Frontiers in Ecology and Evolution 0.71 0.91 0.80 11
Frontiers in Education 1.00 0.75 0.86 4
Frontiers in Endocrinology 1.00 0.53 0.69 17
Frontiers in Energy Research 1.00 1.00 1.00 5
Frontiers in Environmental Science 0.67 0.67 0.67 3
Frontiers in Forests and Global Change 1.00 1.00 1.00 3
Frontiers in Genetics 0.78 0.88 0.83 33
Frontiers in Human Neuroscience 0.50 0.38 0.43 8
Frontiers in Immunology 0.90 0.89 0.89 61
Frontiers in Integrative Neuroscience 0.00 0.00 0.00 2
Frontiers in Marine Science 0.90 0.86 0.88 21
Frontiers in Materials 1.00 1.00 1.00 6
Frontiers in Mechanical Engineering 0.00 0.00 0.00 2
Frontiers in Medicine 0.55 0.86 0.67 14
Frontiers in Microbiology 0.91 0.92 0.92 79
Frontiers in Molecular Biosciences 0.67 0.33 0.44 6
Frontiers in Molecular Neuroscience 1.00 0.50 0.67 8
Frontiers in Neural Circuits 0.00 0.00 0.00 2
Frontiers in Neuroanatomy 0.00 0.00 0.00 1
Frontiers in Neuroinformatics 0.00 0.00 0.00 1
Frontiers in Neurology 0.78 0.84 0.81 25
Frontiers in Neurorobotics 1.00 0.67 0.80 3
Frontiers in Neuroscience 0.56 0.76 0.65 29
Frontiers in Nutrition 0.00 0.00 0.00 3
Frontiers in Oncology 0.86 0.98 0.91 43
Frontiers in Pediatrics 0.73 0.79 0.76 14
Frontiers in Pharmacology 0.78 0.91 0.84 57
Frontiers in Physics 0.47 0.82 0.60 11
Frontiers in Physiology 0.88 0.66 0.75 35
Frontiers in Plant Science 1.00 0.90 0.95 48
Frontiers in Psychiatry 1.00 1.00 1.00 28
Frontiers in Psychology 0.79 0.96 0.86 73
Frontiers in Public Health 1.00 0.70 0.82 10
Frontiers in Robotics and AI 0.80 0.80 0.80 5
Frontiers in Sociology 0.00 0.00 0.00 1
Frontiers in Sports and Active Living 1.00 0.33 0.50 3
Frontiers in Surgery 0.00 0.00 0.00 3
Frontiers in Sustainable Food Systems 1.00 0.25 0.40 4
Frontiers in Synaptic Neuroscience 0.00 0.00 0.00 1
Frontiers in Systems Neuroscience 0.00 0.00 0.00 3
Frontiers in Veterinary Science 0.93 0.93 0.93 15
accuracy 0.81 833
macro avg 0.58 0.54 0.55 833
weighted avg 0.80 0.81 0.79 833
model = IO(filename="evaluation_document_word2vec",folder="05_report",format_="json").load()
print(f'Average accuracy: {model["accuracy_total"]}; MRR: {model["mean_reciprocal_rank"]}')
print(model["precision_recall_f1score"])
Average accuracy: 0.66; MRR: 0.52
precision recall f1-score support
Frontiers for Young Minds 1.00 1.00 1.00 6
Frontiers in Aging Neuroscience 0.44 0.40 0.42 10
Frontiers in Applied Mathematics and Statistics 0.40 1.00 0.57 2
Frontiers in Artificial Intelligence 0.50 1.00 0.67 1
Frontiers in Astronomy and Space Sciences 1.00 1.00 1.00 1
Frontiers in Behavioral Neuroscience 0.62 0.71 0.67 7
Frontiers in Big Data 0.00 0.00 0.00 1
Frontiers in Bioengineering and Biotechnology 0.73 0.44 0.55 18
Frontiers in Blockchain 1.00 0.50 0.67 2
Frontiers in Built Environment 0.38 1.00 0.55 3
Frontiers in Cardiovascular Medicine 0.29 0.50 0.36 4
Frontiers in Cell and Developmental Biology 0.34 0.87 0.49 15
Frontiers in Cellular Neuroscience 0.57 0.89 0.70 9
Frontiers in Cellular and Infection Microbiology 0.75 0.80 0.77 15
Frontiers in Chemistry 0.82 0.78 0.80 23
Frontiers in Communication 1.00 1.00 1.00 1
Frontiers in Computational Neuroscience 0.44 0.80 0.57 5
Frontiers in Earth Science 0.82 1.00 0.90 9
Frontiers in Ecology and Evolution 0.60 0.82 0.69 11
Frontiers in Education 0.30 0.75 0.43 4
Frontiers in Endocrinology 0.60 0.71 0.65 17
Frontiers in Energy Research 0.62 1.00 0.77 5
Frontiers in Environmental Science 0.20 0.33 0.25 3
Frontiers in Forests and Global Change 0.50 1.00 0.67 3
Frontiers in Genetics 0.72 0.55 0.62 33
Frontiers in Human Neuroscience 0.35 0.88 0.50 8
Frontiers in Immunology 0.90 0.85 0.87 61
Frontiers in Integrative Neuroscience 0.00 0.00 0.00 2
Frontiers in Marine Science 0.85 0.52 0.65 21
Frontiers in Materials 0.86 1.00 0.92 6
Frontiers in Mechanical Engineering 0.50 1.00 0.67 2
Frontiers in Medicine 0.38 0.43 0.40 14
Frontiers in Microbiology 0.92 0.70 0.79 79
Frontiers in Molecular Biosciences 0.36 0.67 0.47 6
Frontiers in Molecular Neuroscience 0.56 0.62 0.59 8
Frontiers in Neural Circuits 0.20 0.50 0.29 2
Frontiers in Neuroanatomy 0.00 0.00 0.00 1
Frontiers in Neuroinformatics 0.00 0.00 0.00 1
Frontiers in Neurology 0.67 0.64 0.65 25
Frontiers in Neurorobotics 0.15 0.67 0.25 3
Frontiers in Neuroscience 0.58 0.24 0.34 29
Frontiers in Nutrition 0.00 0.00 0.00 3
Frontiers in Oncology 0.84 0.88 0.86 43
Frontiers in Pediatrics 0.44 0.50 0.47 14
Frontiers in Pharmacology 0.92 0.63 0.75 57
Frontiers in Physics 0.80 0.36 0.50 11
Frontiers in Physiology 0.71 0.29 0.41 35
Frontiers in Plant Science 0.97 0.71 0.82 48
Frontiers in Psychiatry 0.65 0.54 0.59 28
Frontiers in Psychology 0.93 0.78 0.85 73
Frontiers in Public Health 0.57 0.40 0.47 10
Frontiers in Robotics and AI 0.67 0.80 0.73 5
Frontiers in Sociology 0.33 1.00 0.50 1
Frontiers in Sports and Active Living 0.50 0.67 0.57 3
Frontiers in Surgery 0.00 0.00 0.00 3
Frontiers in Sustainable Food Systems 0.50 0.50 0.50 4
Frontiers in Synaptic Neuroscience 0.25 1.00 0.40 1
Frontiers in Systems Neuroscience 0.50 0.33 0.40 3
Frontiers in Veterinary Science 0.69 0.73 0.71 15
accuracy 0.66 833
macro avg 0.55 0.64 0.55 833
weighted avg 0.74 0.66 0.68 833
model = IO(filename="evaluation_document_sbert",folder="05_report",format_="json").load()
print(f'Average accuracy: {model["accuracy_total"]}; MRR: {model["mean_reciprocal_rank"]}')
print(model["precision_recall_f1score"])
Average accuracy: 0.83; MRR: 0.71
precision recall f1-score support
Frontiers for Young Minds 1.00 0.33 0.50 6
Frontiers in Aging Neuroscience 0.70 0.70 0.70 10
Frontiers in Applied Mathematics and Statistics 1.00 1.00 1.00 2
Frontiers in Artificial Intelligence 1.00 1.00 1.00 1
Frontiers in Astronomy and Space Sciences 1.00 1.00 1.00 1
Frontiers in Behavioral Neuroscience 0.56 0.71 0.63 7
Frontiers in Big Data 0.50 1.00 0.67 1
Frontiers in Bioengineering and Biotechnology 0.80 0.67 0.73 18
Frontiers in Blockchain 1.00 1.00 1.00 2
Frontiers in Built Environment 1.00 1.00 1.00 3
Frontiers in Cardiovascular Medicine 0.57 1.00 0.73 4
Frontiers in Cell and Developmental Biology 0.56 1.00 0.71 15
Frontiers in Cellular Neuroscience 0.62 0.89 0.73 9
Frontiers in Cellular and Infection Microbiology 0.67 0.93 0.78 15
Frontiers in Chemistry 0.95 0.87 0.91 23
Frontiers in Communication 1.00 1.00 1.00 1
Frontiers in Computational Neuroscience 0.60 0.60 0.60 5
Frontiers in Earth Science 0.90 1.00 0.95 9
Frontiers in Ecology and Evolution 0.85 1.00 0.92 11
Frontiers in Education 1.00 0.75 0.86 4
Frontiers in Endocrinology 0.94 0.94 0.94 17
Frontiers in Energy Research 0.83 1.00 0.91 5
Frontiers in Environmental Science 1.00 1.00 1.00 3
Frontiers in Forests and Global Change 1.00 1.00 1.00 3
Frontiers in Genetics 0.78 0.85 0.81 33
Frontiers in Human Neuroscience 0.40 0.75 0.52 8
Frontiers in Immunology 0.97 0.93 0.95 61
Frontiers in Integrative Neuroscience 0.00 0.00 0.00 2
Frontiers in Marine Science 0.90 0.86 0.88 21
Frontiers in Materials 1.00 1.00 1.00 6
Frontiers in Mechanical Engineering 1.00 1.00 1.00 2
Frontiers in Medicine 0.67 0.71 0.69 14
Frontiers in Microbiology 0.94 0.85 0.89 79
Frontiers in Molecular Biosciences 0.50 0.83 0.62 6
Frontiers in Molecular Neuroscience 0.40 1.00 0.57 8
Frontiers in Neural Circuits 0.20 0.50 0.29 2
Frontiers in Neuroanatomy 1.00 1.00 1.00 1
Frontiers in Neuroinformatics 0.33 1.00 0.50 1
Frontiers in Neurology 0.88 0.88 0.88 25
Frontiers in Neurorobotics 0.75 1.00 0.86 3
Frontiers in Neuroscience 0.62 0.28 0.38 29
Frontiers in Nutrition 0.50 0.67 0.57 3
Frontiers in Oncology 0.91 1.00 0.96 43
Frontiers in Pediatrics 0.85 0.79 0.81 14
Frontiers in Pharmacology 0.92 0.77 0.84 57
Frontiers in Physics 1.00 0.82 0.90 11
Frontiers in Physiology 0.81 0.49 0.61 35
Frontiers in Plant Science 0.98 0.92 0.95 48
Frontiers in Psychiatry 1.00 0.93 0.96 28
Frontiers in Psychology 0.92 0.90 0.91 73
Frontiers in Public Health 0.89 0.80 0.84 10
Frontiers in Robotics and AI 1.00 0.40 0.57 5
Frontiers in Sociology 1.00 1.00 1.00 1
Frontiers in Sports and Active Living 0.60 1.00 0.75 3
Frontiers in Surgery 0.67 0.67 0.67 3
Frontiers in Sustainable Food Systems 1.00 1.00 1.00 4
Frontiers in Synaptic Neuroscience 1.00 1.00 1.00 1
Frontiers in Systems Neuroscience 0.67 0.67 0.67 3
Frontiers in Veterinary Science 0.94 1.00 0.97 15
accuracy 0.83 833
macro avg 0.80 0.84 0.80 833
weighted avg 0.86 0.83 0.83 833
model = IO(filename="evaluation_keywords_tfidf",folder="05_report",format_="json").load()
print(f'Average accuracy: {model["accuracy_total"]}; MRR: {model["mean_reciprocal_rank"]}')
print(model["precision_recall_f1score"])
Average accuracy: 0.56; MRR: 0.42
precision recall f1-score support
Frontiers for Young Minds 0.00 0.00 0.00 6
Frontiers in Aging Neuroscience 0.35 0.70 0.47 10
Frontiers in Applied Mathematics and Statistics 0.00 0.00 0.00 2
Frontiers in Artificial Intelligence 0.00 0.00 0.00 1
Frontiers in Astronomy and Space Sciences 0.00 0.00 0.00 1
Frontiers in Behavioral Neuroscience 0.20 0.29 0.24 7
Frontiers in Big Data 0.00 0.00 0.00 1
Frontiers in Bioengineering and Biotechnology 0.39 0.39 0.39 18
Frontiers in Blockchain 0.40 1.00 0.57 2
Frontiers in Built Environment 0.00 0.00 0.00 3
Frontiers in Cardiovascular Medicine 0.10 0.25 0.14 4
Frontiers in Cell and Developmental Biology 0.60 0.40 0.48 15
Frontiers in Cellular Neuroscience 0.00 0.00 0.00 9
Frontiers in Cellular and Infection Microbiology 0.36 0.27 0.31 15
Frontiers in Chemistry 0.65 0.48 0.55 23
Frontiers in Communication 0.33 1.00 0.50 1
Frontiers in Computational Neuroscience 0.30 0.60 0.40 5
Frontiers in Earth Science 0.67 0.44 0.53 9
Frontiers in Ecology and Evolution 0.25 0.18 0.21 11
Frontiers in Education 0.33 0.50 0.40 4
Frontiers in Endocrinology 0.75 0.35 0.48 17
Frontiers in Energy Research 0.25 0.60 0.35 5
Frontiers in Environmental Science 0.17 0.33 0.22 3
Frontiers in Forests and Global Change 0.33 0.67 0.44 3
Frontiers in Genetics 0.53 0.70 0.61 33
Frontiers in Human Neuroscience 0.22 0.25 0.24 8
Frontiers in Immunology 0.69 0.85 0.76 61
Frontiers in Integrative Neuroscience 0.00 0.00 0.00 2
Frontiers in Marine Science 0.81 0.62 0.70 21
Frontiers in Materials 0.40 0.33 0.36 6
Frontiers in Mechanical Engineering 0.00 0.00 0.00 2
Frontiers in Medicine 0.57 0.29 0.38 14
Frontiers in Microbiology 0.84 0.67 0.75 79
Frontiers in Molecular Biosciences 0.09 0.17 0.12 6
Frontiers in Molecular Neuroscience 0.14 0.12 0.13 8
Frontiers in Neural Circuits 0.40 1.00 0.57 2
Frontiers in Neuroanatomy 0.00 0.00 0.00 1
Frontiers in Neuroinformatics 0.00 0.00 0.00 1
Frontiers in Neurology 0.56 0.56 0.56 25
Frontiers in Neurorobotics 0.40 0.67 0.50 3
Frontiers in Neuroscience 0.60 0.52 0.56 29
Frontiers in Nutrition 0.29 0.67 0.40 3
Frontiers in Oncology 0.72 0.77 0.74 43
Frontiers in Pediatrics 0.56 0.64 0.60 14
Frontiers in Pharmacology 0.70 0.49 0.58 57
Frontiers in Physics 0.29 0.45 0.36 11
Frontiers in Physiology 0.65 0.49 0.56 35
Frontiers in Plant Science 0.78 0.60 0.68 48
Frontiers in Psychiatry 0.71 0.79 0.75 28
Frontiers in Psychology 0.81 0.70 0.75 73
Frontiers in Public Health 0.45 0.50 0.48 10
Frontiers in Robotics and AI 0.60 0.60 0.60 5
Frontiers in Sociology 0.00 0.00 0.00 1
Frontiers in Sports and Active Living 0.00 0.00 0.00 3
Frontiers in Surgery 0.20 0.33 0.25 3
Frontiers in Sustainable Food Systems 0.33 0.25 0.29 4
Frontiers in Synaptic Neuroscience 0.20 1.00 0.33 1
Frontiers in Systems Neuroscience 0.25 0.33 0.29 3
Frontiers in Veterinary Science 0.67 0.53 0.59 15
accuracy 0.56 833
macro avg 0.35 0.41 0.36 833
weighted avg 0.60 0.56 0.57 833
model = IO(filename="evaluation_keywords_word2vec",folder="05_report",format_="json").load()
print(f'Average accuracy: {model["accuracy_total"]}; MRR: {model["mean_reciprocal_rank"]}')
print(model["precision_recall_f1score"])
Average accuracy: 0.64; MRR: 0.5
precision recall f1-score support
Frontiers for Young Minds 0.00 0.00 0.00 6
Frontiers in Aging Neuroscience 0.71 0.50 0.59 10
Frontiers in Applied Mathematics and Statistics 0.50 1.00 0.67 2
Frontiers in Artificial Intelligence 0.00 0.00 0.00 1
Frontiers in Astronomy and Space Sciences 0.50 1.00 0.67 1
Frontiers in Behavioral Neuroscience 0.40 0.57 0.47 7
Frontiers in Big Data 0.00 0.00 0.00 1
Frontiers in Bioengineering and Biotechnology 0.57 0.44 0.50 18
Frontiers in Blockchain 1.00 1.00 1.00 2
Frontiers in Built Environment 1.00 1.00 1.00 3
Frontiers in Cardiovascular Medicine 0.00 0.00 0.00 4
Frontiers in Cell and Developmental Biology 0.34 0.67 0.45 15
Frontiers in Cellular Neuroscience 0.33 0.33 0.33 9
Frontiers in Cellular and Infection Microbiology 0.52 0.73 0.61 15
Frontiers in Chemistry 0.75 0.65 0.70 23
Frontiers in Communication 0.00 0.00 0.00 1
Frontiers in Computational Neuroscience 0.33 0.40 0.36 5
Frontiers in Earth Science 1.00 0.67 0.80 9
Frontiers in Ecology and Evolution 0.56 0.82 0.67 11
Frontiers in Education 0.25 0.25 0.25 4
Frontiers in Endocrinology 0.54 0.41 0.47 17
Frontiers in Energy Research 0.62 1.00 0.77 5
Frontiers in Environmental Science 1.00 0.67 0.80 3
Frontiers in Forests and Global Change 1.00 1.00 1.00 3
Frontiers in Genetics 0.61 0.58 0.59 33
Frontiers in Human Neuroscience 0.40 0.75 0.52 8
Frontiers in Immunology 0.75 0.74 0.74 61
Frontiers in Integrative Neuroscience 0.00 0.00 0.00 2
Frontiers in Marine Science 0.92 0.57 0.71 21
Frontiers in Materials 0.55 1.00 0.71 6
Frontiers in Mechanical Engineering 1.00 0.50 0.67 2
Frontiers in Medicine 0.30 0.43 0.35 14
Frontiers in Microbiology 0.86 0.68 0.76 79
Frontiers in Molecular Biosciences 0.18 0.50 0.26 6
Frontiers in Molecular Neuroscience 0.33 0.62 0.43 8
Frontiers in Neural Circuits 0.33 0.50 0.40 2
Frontiers in Neuroanatomy 0.00 0.00 0.00 1
Frontiers in Neuroinformatics 0.00 0.00 0.00 1
Frontiers in Neurology 0.60 0.48 0.53 25
Frontiers in Neurorobotics 0.33 1.00 0.50 3
Frontiers in Neuroscience 0.53 0.28 0.36 29
Frontiers in Nutrition 0.29 0.67 0.40 3
Frontiers in Oncology 0.75 0.84 0.79 43
Frontiers in Pediatrics 0.71 0.71 0.71 14
Frontiers in Pharmacology 0.86 0.65 0.74 57
Frontiers in Physics 0.60 0.82 0.69 11
Frontiers in Physiology 0.83 0.57 0.68 35
Frontiers in Plant Science 0.95 0.73 0.82 48
Frontiers in Psychiatry 0.60 0.64 0.62 28
Frontiers in Psychology 0.80 0.82 0.81 73
Frontiers in Public Health 0.39 0.70 0.50 10
Frontiers in Robotics and AI 0.50 0.40 0.44 5
Frontiers in Sociology 0.00 0.00 0.00 1
Frontiers in Sports and Active Living 0.50 0.33 0.40 3
Frontiers in Surgery 0.25 0.33 0.29 3
Frontiers in Sustainable Food Systems 0.60 0.75 0.67 4
Frontiers in Synaptic Neuroscience 0.00 0.00 0.00 1
Frontiers in Systems Neuroscience 0.29 0.67 0.40 3
Frontiers in Veterinary Science 0.77 0.67 0.71 15
accuracy 0.64 833
macro avg 0.50 0.54 0.50 833
weighted avg 0.68 0.64 0.65 833
model = IO(filename="evaluation_keywords_sbert",folder="05_report",format_="json").load()
print(f'Average accuracy: {model["accuracy_total"]}; MRR: {model["mean_reciprocal_rank"]}')
print(model["precision_recall_f1score"])
Average accuracy: 0.73; MRR: 0.56
precision recall f1-score support
Frontiers for Young Minds 0.30 0.50 0.37 6
Frontiers in Aging Neuroscience 0.67 0.60 0.63 10
Frontiers in Applied Mathematics and Statistics 0.40 1.00 0.57 2
Frontiers in Artificial Intelligence 1.00 1.00 1.00 1
Frontiers in Astronomy and Space Sciences 1.00 1.00 1.00 1
Frontiers in Behavioral Neuroscience 0.58 1.00 0.74 7
Frontiers in Big Data 0.25 1.00 0.40 1
Frontiers in Bioengineering and Biotechnology 0.67 0.56 0.61 18
Frontiers in Blockchain 1.00 1.00 1.00 2
Frontiers in Built Environment 1.00 0.67 0.80 3
Frontiers in Cardiovascular Medicine 0.50 0.75 0.60 4
Frontiers in Cell and Developmental Biology 0.48 0.73 0.58 15
Frontiers in Cellular Neuroscience 0.57 0.44 0.50 9
Frontiers in Cellular and Infection Microbiology 0.55 0.80 0.65 15
Frontiers in Chemistry 0.86 0.83 0.84 23
Frontiers in Communication 1.00 1.00 1.00 1
Frontiers in Computational Neuroscience 0.36 0.80 0.50 5
Frontiers in Earth Science 0.90 1.00 0.95 9
Frontiers in Ecology and Evolution 0.80 0.73 0.76 11
Frontiers in Education 1.00 0.50 0.67 4
Frontiers in Endocrinology 0.62 0.47 0.53 17
Frontiers in Energy Research 1.00 0.60 0.75 5
Frontiers in Environmental Science 0.67 0.67 0.67 3
Frontiers in Forests and Global Change 0.75 1.00 0.86 3
Frontiers in Genetics 0.85 0.67 0.75 33
Frontiers in Human Neuroscience 0.46 0.75 0.57 8
Frontiers in Immunology 0.75 0.85 0.80 61
Frontiers in Integrative Neuroscience 0.00 0.00 0.00 2
Frontiers in Marine Science 1.00 0.81 0.89 21
Frontiers in Materials 0.75 1.00 0.86 6
Frontiers in Mechanical Engineering 1.00 1.00 1.00 2
Frontiers in Medicine 0.58 0.50 0.54 14
Frontiers in Microbiology 0.90 0.81 0.85 79
Frontiers in Molecular Biosciences 0.31 0.67 0.42 6
Frontiers in Molecular Neuroscience 0.36 0.50 0.42 8
Frontiers in Neural Circuits 0.50 1.00 0.67 2
Frontiers in Neuroanatomy 0.00 0.00 0.00 1
Frontiers in Neuroinformatics 0.00 0.00 0.00 1
Frontiers in Neurology 0.57 0.48 0.52 25
Frontiers in Neurorobotics 0.75 1.00 0.86 3
Frontiers in Neuroscience 0.53 0.31 0.39 29
Frontiers in Nutrition 0.33 0.67 0.44 3
Frontiers in Oncology 0.79 0.88 0.84 43
Frontiers in Pediatrics 0.82 0.64 0.72 14
Frontiers in Pharmacology 0.75 0.70 0.73 57
Frontiers in Physics 0.70 0.64 0.67 11
Frontiers in Physiology 0.76 0.54 0.63 35
Frontiers in Plant Science 0.91 0.83 0.87 48
Frontiers in Psychiatry 0.70 0.75 0.72 28
Frontiers in Psychology 0.88 0.84 0.86 73
Frontiers in Public Health 0.58 0.70 0.64 10
Frontiers in Robotics and AI 1.00 0.60 0.75 5
Frontiers in Sociology 0.00 0.00 0.00 1
Frontiers in Sports and Active Living 0.60 1.00 0.75 3
Frontiers in Surgery 0.50 0.67 0.57 3
Frontiers in Sustainable Food Systems 0.75 0.75 0.75 4
Frontiers in Synaptic Neuroscience 1.00 1.00 1.00 1
Frontiers in Systems Neuroscience 0.33 0.67 0.44 3
Frontiers in Veterinary Science 0.75 0.80 0.77 15
accuracy 0.73 833
macro avg 0.65 0.71 0.66 833
weighted avg 0.75 0.73 0.73 833
The results show a better performance using SBERT to create a document embedding using all text. However, the performance is quite similar to the TFIDF approach. The method to use in a production environment depends on the application. If the main purpose is accuracy, SBERT is the right choice, but such a method is computationally expensive. While TFIDF is faster but it needs a high RAM to load such large vectors.
Another interesting result is the comparison between TFIDF and Word2Vec. Indeed, as previously described in the paper Meijer et al. Document Embedding for Scientific Articles: Efficacy of Word Embeddings vs TFIDF. 2021, I observe, using the entire document, TFIDF performs better compared with word2Vec, while using keywords the result is inverted (word2vec performs better than TFIDF). What really is unexpected is SBERT approach works better than both the other methods in all cases.
Finally, the best model is deployed as REST API FastAPI and accessed through a very simple web interface Streamlit.
The best model chosen is the SBERT, due to its better performance compared with the others. However, it is the slowest one. In production, the environment should be evaluated as the alternative to deploying it in a GPU server or moving to another model (like TFIDF).